[ROCm] Fix Multi-Kernel Support for Triton Templates on ROCm #2817

chinmaydk99 · 2025-11-19T20:15:27Z

This PR fixes critical bugs in PyTorch Inductor's multi-kernel support for ROCm/HIP, enabling dynamic shape-based kernel selection for Triton templates. The changes resolve both compilation failures and runtime crashes when using config.multi_kernel_hints.

UTs being targetted:

test/inductor/test_multi_kernel.py - test_triton_gemm
test/inductor/test_multi_kernel.py - test_triton_relu_fused_gemm

…7661)" This reverts commit 1b43d6c. Reverted pytorch#167661 on behalf of https://github.com/yangw-dev due to break internal tests and build, please reach out meta fellas to have fix it and reland again, error examplke: hip/KernelUtils.cuh:74:5: error: no matching function for call to 'unsafeAtomicAdd' ([comment](pytorch#167661 (comment)))

Summary: The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative. The decorator will emit a FutureWarning at runtime, and the docstring now includes a .. deprecated:: directive for documentation visibility. Test Plan: Manual verification that the decorator is properly applied and the deprecation message is informative. Differential Revision: D87272399 Pull Request resolved: pytorch#168036 Approved by: https://github.com/valentinandrei

This PR introduces a `Tensor` subclass which represents a complex tensor in terms of two real ones. Ops are decomposed as individual ops on the real and imaginary parts. It is compatible with `torch.compile`, so long as the real ops used are also compatible. Autograd "works", but is WIP due to different edge-case behaviour. Pull Request resolved: pytorch#167621 Approved by: https://github.com/ezyang

Repeatition of pytorch#155708 Has been broken for a while, and ET pin in Pytorch are so old that `torch==2.10.0.dev20250915` could no longer be found in nightly indices Pull Request resolved: pytorch#168090 Approved by: https://github.com/atalman, https://github.com/yangw-dev

This PR enables special matmuls on Thor devices. This includes row-wise scaled matmul on `fp8` and group gemm on `bfloat16`. Pull Request resolved: pytorch#164836 Approved by: https://github.com/ngimel

…orch#167395) This adds a debug HTTP server for debugging stuck or slow jobs. It runs the WorkerServer on every worker and then launches a separate flask process on rank 0 to have users connect to for debugging. This can easily be improved to trigger profilers as well as visualize the data much better. Initial handlers: * pytorch profiler * FlightRecorder data * Python stacks ``` os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000" from torch.distributed.debug import enable_debug_server enable_debug_server() ``` Test plan: ``` torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/debug_test.py ``` <img width="2000" height="1045" alt="20251117_16h58m18s_grim" src="https://github.com/user-attachments/assets/82305b75-227c-4412-a481-00b622db6bd1" /> <img width="2172" height="1624" alt="20251117_16h58m11s_grim" src="https://github.com/user-attachments/assets/def9841c-c7e6-483a-81c3-cf0c56f6bad8" /> <img width="1985" height="1635" alt="20251117_16h58m03s_grim" src="https://github.com/user-attachments/assets/04fcf148-df58-41b4-8754-8706ee0d1de6" /> Pull Request resolved: pytorch#167395 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/atalman

…ytorch#167079) Summary: As title. Knowing the size of the leaked tensor is useful, it allows us to focus on the largest leaks. Differential Revision: D86218574 Pull Request resolved: pytorch#167079 Approved by: https://github.com/kausv

…torch#161703) it's another pr to port distributed tensor test for Intel GPU, while the other pr is pytorch#161604 We could enable Intel GPU with following methods and try the best to keep the original code styles: Use torch.accelerator for general gpu Skip the case if running on xpu which has known issues Pull Request resolved: pytorch#161703 Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/albanD

Pull Request resolved: pytorch#167852 Approved by: https://github.com/fmassa

The all gather bucketing was part of the way to fusing in dtype casts into the bucket. We do this by allocating the group bucket buffer, then viewing each slice of it as the destination dtype. We then foreach_copy_ into the allocated buffer, with each collective copying in to its destination dtype. This logic was causing an issue in a later part of the stack, but not fully firing, so might as well fix it. Note: custom ops dont yet support list[dtype], so i worked around by list[int], but will fix in a follow up. Pull Request resolved: pytorch#167853 Approved by: https://github.com/ruisizhang123 ghstack dependencies: pytorch#167852

The bucketing dtype fusing was causing nodes which had dependencies to be erased. Transfer those deps over to the new nodes, and also add an assertion that none of our deps are erased to catch this type of error in the future. Pull Request resolved: pytorch#167863 Approved by: https://github.com/fmassa ghstack dependencies: pytorch#167852, pytorch#167853

Since the currently intended workflow on the new MI3xx CI capacity is [trunk-rocm-mi300.yml](https://github.com/pytorch/pytorch/blob/d91269e8ce309437c1f849b5ab3362d69b178ef4/.github/workflows/trunk-rocm-mi300.yml#L54), which only needs the jammy images, limiting those to optimize docker caching times. Pull Request resolved: pytorch#168088 Approved by: https://github.com/jeffdaily

For GPU: Previously reported that only a single sample could be tested with huber_loss functional. Current snapshot of the code does not appear to suffer from numerical issues as reported before. For CPU: While testing GPU, it was discovered that with Half appears to be numerically unstable. This commit resolves issue with CPU by upcasting Half to float for the computation. Pull Request resolved: pytorch#166952 Approved by: https://github.com/benjaminglass1, https://github.com/isuruf

…h/csrc/Exceptions.h (pytorch#168056) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D87273132 Pull Request resolved: pytorch#168056 Approved by: https://github.com/malfet, https://github.com/Skylion007

Summary: If the Tensor has a PyObject, it's use count will now be two instead of one. Test Plan: `buck test -j 18 fbcode//mode/dev-nosan fbcode//caffe2/test:torch` Differential Revision: D87297965 Pull Request resolved: pytorch#168060 Approved by: https://github.com/albanD, https://github.com/Skylion007

As compiler has not been supported for last 3 years and all manylinux2_28 builds should have at least gcc-11 Prep change for C++20 standard migration Pull Request resolved: pytorch#167933 Approved by: https://github.com/yangw-dev, https://github.com/atalman ghstack dependencies: pytorch#168090

Pull Request resolved: pytorch#166903 Approved by: https://github.com/malfet

…rch#168104) We only want to cache the latest CI docker image for `main` and `release` branches in cases where multiple `docker-builds` workflow runs get triggered in quick succession. This is because the latest run will anyway overwrite the cached images, since we do not maintain a cached image per-SHA, instead it's only one-per-branch (to minimize cache size and docker load times at runner bringup). Also removing `workflow_dispatch` as a trigger since it won't work (needs artifacts from `docker-builds` run) Pull Request resolved: pytorch#168104 Approved by: https://github.com/jeffdaily

Fixes pytorch#167905 Below typo correction has been done. Existing comment: // List of Any can contains heterogenous types Suggested comment: // List of Any can contains heterogeneous types Pull Request resolved: pytorch#167907 Approved by: https://github.com/albanD

Unclear which PR in the ghstack caused the ROCm failure. Stack was (oldest at bottom): - pytorch#167962 - pytorch#167804 - pytorch#167803 - pytorch#167802 - pytorch#168025 Fixes the following test: PYTORCH_TEST_WITH_ROCM=1 python test/cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility.py FunctionVersionCompatibilityTest.test_mv_tensor_accessor_cuda_works_with_2_9 Pull Request resolved: pytorch#168087 Approved by: https://github.com/jeffdaily, https://github.com/janeyx99 Co-authored-by: Jeff Daily <[email protected]> Co-authored-by: Jane (Yuan) Xu <[email protected]>

Fixes false negative (illusion): "all B200 periodic nvshmem-triton tests passed" Pull Request resolved: pytorch#167760 Approved by: https://github.com/ngimel

# Motivation This is definitely a bug: we were attempting to release cached memory back to the system without proper **synchronization**. Callers must ensure that all accesses to memory blocks allocated by SYCL APIs have completed before invoking `sycl::free`. For a simple example, in the following code: ```python pool = torch.xpu.MemPool() with torch.xpu.use_mem_pool(pool): input = torch.randn(100, device='xpu') sum = input.sum() del pool print(sum) ``` `sum` may exhibit undefined behavior because `input.sum()` might not have finished executing before `del pool` triggers `input`'s memory release. With this fix, we ensure that all kernels on the associated streams complete before the memory pool is destroyed, guaranteeing that `sum` holds the correct value. # Solution Because `c10::xpu::syncStreamsOnDevice` has host overhead, we use a boolean flag `streams_synced` to ensure it is called only once. Pull Request resolved: pytorch#168074 Approved by: https://github.com/EikanWang

…rch#167923) As in the title. The my_shape test is added to reproduce https://github.com/pytorch/audio/actions/runs/19395471276/job/55494871226: Pull Request resolved: pytorch#167923 Approved by: https://github.com/janeyx99, https://github.com/mikaylagawarecki

…6833) The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143). - [ ] pytorch#166831 - [ ] ->pytorch#166833 - [ ] pytorch#166843 Pull Request resolved: pytorch#166833 Approved by: https://github.com/EikanWang, https://github.com/gujinghui

Summary: Fix pytorch#167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. Pull Request resolved: pytorch#168063 Approved by: https://github.com/yushangdi

Pull Request resolved: pytorch#167769 Approved by: https://github.com/ngimel, https://github.com/leofang

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#168111 Approved by: https://github.com/ezyang

…torch#166273) Partially vibe-coded with ClaudeCode, and changes following ops (summary also created by Claude): - **Activation operations**: Added checks rejecting Long, Complex, and Bool types for operations like log_softmax, log_sigmoid, mish, softplus, and silu, as MPS doesn't support exponent operations on these types - **Linear algebra operations**: Restricted linalg_lu_factor, linalg_solve, and linalg_solve_triangular to Float type only (previously only checked for complex types) - **Pooling operations**: Added checks to reject Complex types for avg_pool2d and max_pool2d operations - **Loss functions**: Added type checks for nll_loss (Complex), huber_loss (Long, Complex), and grid_sampler_2d (Complex) - **Reduction operations**: - Fixed NANSUM to handle integral types correctly (can't contain NaN, so just performs regular sum) - Added Long type check for std/var operations - **Other operations**: - softmax: Now explicitly requires floating point types - bincount: Rejects Bool type to prevent crashes All checks use `TORCH_CHECK_NOT_IMPLEMENTED` Pull Request resolved: pytorch#166273 Approved by: https://github.com/manuelcandales

Summary: Shrink binary size to reduce relocation overflows. The most important change is to split `intrusive_ptr::reset_()` into two functions and mark the bigger one as `C10_NOINLINE`. Differential Revision: D87308588 Pull Request resolved: pytorch#168080 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet, https://github.com/ezyang

@KarhouTam

# Motivation Thanks to @KarhouTam for finding the issue mentioned in pytorch#167172 This PR aims to improve the build logic in activities for kineto. # Additional Context Fix pytorch#167172 Pull Request resolved: pytorch#167204 Approved by: https://github.com/EikanWang, https://github.com/ezyang

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: pytorch#168130 Approved by: https://github.com/pytorchbot

This PR adds support for effectful ops within invoke_subgraphs. * Most of the logic is in `invoke_subgraph.py_functionalize_impl`. * In the functionalization metadata collection phase, we note the tokens before going further down the dispatcher, and then note the tokens after coming back from the dispatcher. If there are nodes in the invoke_subgraph subgraph that contain effects, the number of effects should change, or the tokens used for an effect should. * We will store this effect difference in the `InvokeSubgraphCache` where the key is the identifier and value is the effect. For now we only support one effect within a subgraph. * During the tracing part of AOTAutograd, we will then wrap the subgraph to take in and output a token. Before: ``` def forward(self, x): repeated_subgraph0 = self.repeated_subgraph0 invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', x) return invoke_subgraph def repeated_subgraph(self, x): record_memory = torch.ops.mylib.record_memory.default("forward", "N") add = torch.ops.aten.add(x, x) return add ``` After: ``` def forward(self, token, x): repeated_subgraph0 = self.repeated_subgraph0 invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', token, x) getitem = invoke_subgraph[0] # output token getitem_1 = invoke_subgraph[1] return (getitem, getitem_1) def repeated_subgraph(self, token, x): with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N') getitem = with_effects[0] # output token add = torch.ops.aten.add(x, x) return (getitem, add) ``` * Then there is a bunch of logic within `_remove_effect_tokens` to handle removing the effects from the invoke_subgraph subgraph Differential Revision: [D87392741](https://our.internmc.facebook.com/intern/diff/D87392741) Pull Request resolved: pytorch#167231 Approved by: https://github.com/anijain2305

…torch#167245) In the [previous PR](https://github.com/pytorch/pytorch/pull/167231/files#diff-e2b74af5d8b538a7d07d18507d27010703742ddad5f819992b55f5abc6d9a502R964-R966) we found that the autograd eager impl of invoke_subgraph calls the subgraph twice. If the subgraph contains effects then effects will be run twice, which is bad. This PR fixes the issue by getting the output metadata from `subgraph`'s `node.meta` if it exists. Differential Revision: [D87392740](https://our.internmc.facebook.com/intern/diff/D87392740) Pull Request resolved: pytorch#167245 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#167231

`fully_shard`'s `gradient_divide_factor` isn't currently respected when the sharding degree = 1. This PR ensures the division factor applies also in this case. This is a bit of an edge case, but it arises in `torchtitan`, e.g. with expert parallelism and `ep_degree=world_size` we still wrap the routed experts in `fully_shard` because: 1) It lets us take advantage of its mixed-precision mechanisms. 2) [A specific gradient_divide_factor is needed for correctness](https://github.com/pytorch/torchtitan/blob/176498cd4edd4d80e95959a618279681f8295f4c/torchtitan/models/llama4/infra/parallelize.py?plain=1#L364-L369) This PR ensures correctness in the `reduce_scatter_group.size()==1` case. Reproducer and sample failures are in the [gist here](https://gist.github.ibm.com/goon/f67e7559284cc2d322faff1ac59fe382). The net effect is that the EP grads are too-large by a factor of the world size in the case described above. I checked that the proposed fix makes these tests pass. I guess I should add a test for this, too? Pull Request resolved: pytorch#167178 Approved by: https://github.com/weifengpy

…efs_fft_irfftn_cuda_complex64` (pytorch#168016) Otherwise we see e.g., ``` Mismatched elements: 1 / 40320 (0.0%) Greatest absolute difference: 0.0001373291015625 at index (0, 4, 0, 2, 3, 5) (up to 0.0001 allowed) Greatest relative difference: 1.633889951335732e-05 at index (0, 4, 0, 2, 3, 5) (up to 1.3e-06 allowed) ``` Pull Request resolved: pytorch#168016 Approved by: https://github.com/nWEIdia, https://github.com/ezyang

This is the necessary fix for meta-pytorch/autoparallel#256. ### Issue: when we call `_clear_fast_path_sharding_prop_cache()`, and then `get_thread_local_native_sharding_propagator_cache()`, the code will stuck due to deadlock. ### Cause: When you assign to a Python dict key that already exists: ```C++ thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = old_capsule // capsule ROCm#1 stored ... clear_DTensor_sharding_propagator_cache() // call to clean up the cache ... get_thread_local_native_sharding_propagator_cache() { std::lock_guard<std::mutex> lock( native_sharding_propagator_cache_cleanup_mutex); // FIRST claims the lock! if (!native_sharding_propagator_cache_DO_NOT_USE.has_value()) { // enter this again because we have cleared the cache. ... // Destroys old_capsule FIRST then stores new_capsule. However, where we destroy the old_capsule, // it will trigger the destructor to claim `native_sharding_propagator_cache_cleanup_mutex` again! thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = new_capsule // SECOND claims the lock before FIRST releases } } ``` Pull Request resolved: pytorch#168069 Approved by: https://github.com/ezyang

Pull Request resolved: pytorch#167747 Approved by: https://github.com/soulitzer

# Motivation This PR aims to fix the bug that the moved-to object's `external_` member is not assigned correctly. # Additional Context It's not fine to swap the valid value and the invalid value. We'd just need to prevent double-free. Pull Request resolved: pytorch#167711 Approved by: https://github.com/albanD

adds support for randomness in localtensor. tl;dr it needs to be able to handle RNG the same way (i.e., rng tracking/syncing across shards, user-defined seeds, user-defined generators, etc. we extend the existing OffsetBasedRNGTracker to play nicely with localtensor's setup, creating a few small subclasses and patching the core RNG logic to manage the per-rank seeds and offsets correctly. i still haven't done the per-rank generator support (since the existing tests imply a globally-seeded generator), but that it something that should be done. Pull Request resolved: pytorch#166540 Approved by: https://github.com/dzmitry-huba

…rch#166395) Implement a complete OpenRegDeviceAllocator with the following enhancements: - Implement memory statistics tracking (allocated/reserved bytes, allocation count) - Track allocation sizes for accurate memory statistics - Refactor DeviceAllocator's inheritance relationship from c10::DeviceAllocator - This change is for further improvement of adding a memory caching function to DeviceMemory Add comprehensive test coverage: - Memory allocation/deallocation tests with statistics validation - Storage operations and tensor-from-blob tests - Multithreading safety tests for concurrent allocations - Gradient tracking and requires_grad compatibility tests Fixes pytorch#166157 Pull Request resolved: pytorch#166395 Approved by: https://github.com/fffrog

… is defined (pytorch#167496) Fixes pytorch#161660 This extends the `TORCH_STABLE_ONLY` stopgap added in pytorch#161658 Pull Request resolved: pytorch#167496 Approved by: https://github.com/janeyx99, https://github.com/malfet

…extension (pytorch#167855) Address Nikita's offline comment on pytorch#167496 Pull Request resolved: pytorch#167855 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#167496

…#168049)" This reverts commit 8cb8b6c. Reverted pytorch#168049 on behalf of https://github.com/yangw-dev due to D87346992 internal error that conflict the main branch, please rebase and try to merge again These changes have conflicts when merging with master branch. Rebase this diff. ([comment](pytorch#168049 (comment)))

…)" This reverts commit db1551b. Reverted pytorch#168024 on behalf of https://github.com/yangw-dev due to Internal merge fail, These changes have conflicts when merging with master branch. Rebase this diff. please rebase the pr and try merge again ([comment](pytorch#168024 (comment)))

As per title Pull Request resolved: pytorch#168089 Approved by: https://github.com/ezyang, https://github.com/svekars

…cts (pytorch#168149) grad_placements is a sequence like data structure and therefore can be a UserDefinedObject. In that case, we can extract the tuple and pass along. Pull Request resolved: pytorch#168149 Approved by: https://github.com/bdhirsh

pytorchmergebot and others added 30 commits November 18, 2025 17:20

[CUDA][Thor] Enable CUTLASS matmuls on Thor (pytorch#164836)

5333e51

This PR enables special matmuls on Thor devices. This includes row-wise scaled matmul on `fp8` and group gemm on `bfloat16`. Pull Request resolved: pytorch#164836 Approved by: https://github.com/ngimel

small changes (pytorch#167852)

e3c5b78

Pull Request resolved: pytorch#167852 Approved by: https://github.com/fmassa

[MPS] Move elu impl to Metal (pytorch#166903)

dc4f3c7

Pull Request resolved: pytorch#166903 Approved by: https://github.com/malfet

[CI][CUDA] Unskip nvshmem triton tests (pytorch#167760)

878757c

Fixes false negative (illusion): "all B200 periodic nvshmem-triton tests passed" Pull Request resolved: pytorch#167760 Approved by: https://github.com/ngimel

[CD] Add cuda-bindings dependency to CUDA wheels (pytorch#167769)

cea8678

Pull Request resolved: pytorch#167769 Approved by: https://github.com/ngimel, https://github.com/leofang

Update AGENTS.md (pytorch#168111)

13ec55d

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#168111 Approved by: https://github.com/ezyang

pytorchupdatebot and others added 17 commits November 19, 2025 06:11

[user-streams] Insert backward syncs (pytorch#167747)

e5a766e

Pull Request resolved: pytorch#167747 Approved by: https://github.com/soulitzer

Error when non stable/headeronly/shim headers are included by stable …

a0ccd3e

…extension (pytorch#167855) Address Nikita's offline comment on pytorch#167496 Pull Request resolved: pytorch#167855 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#167496

Fix link for core maintainers request form (pytorch#168089)

eefc0f8

As per title Pull Request resolved: pytorch#168089 Approved by: https://github.com/ezyang, https://github.com/svekars

Fixing multi-kernel autotune for different size hints on ROCm

f2dc4ba

chinmaydk99 closed this Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Fix Multi-Kernel Support for Triton Templates on ROCm #2817

[ROCm] Fix Multi-Kernel Support for Triton Templates on ROCm #2817

Uh oh!

chinmaydk99 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

34 participants

[ROCm] Fix Multi-Kernel Support for Triton Templates on ROCm #2817

[ROCm] Fix Multi-Kernel Support for Triton Templates on ROCm #2817

Uh oh!

Conversation

chinmaydk99 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

34 participants